New York Taxi Fare Dataset¶
In this notebook we will study the NY Taxi Fare Dataset (click on it to see the kaggle website). We will make firstly the data exploration, followed by the feature engineering, data cleaning and visualizing to finally test different models to predict the Fare price.
Disclaimer: During the execution of this notebook I had some problems due to the size of our dataset, based on this I decided to analyze only 100000 rows of it, and then only the year that had the highest traffic.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly_express as px
import seaborn as sns
import folium
palette = sns.color_palette("rainbow", 8)
1. Data Exploration¶
df = pd.read_csv('./data/train.csv', nrows= 100000)
df.head(10)
| key | fare_amount | pickup_datetime | pickup_longitude | pickup_latitude | dropoff_longitude | dropoff_latitude | passenger_count | |
|---|---|---|---|---|---|---|---|---|
| 0 | 2009-06-15 17:26:21.0000001 | 4.5 | 2009-06-15 17:26:21 UTC | -73.844311 | 40.721319 | -73.841610 | 40.712278 | 1 |
| 1 | 2010-01-05 16:52:16.0000002 | 16.9 | 2010-01-05 16:52:16 UTC | -74.016048 | 40.711303 | -73.979268 | 40.782004 | 1 |
| 2 | 2011-08-18 00:35:00.00000049 | 5.7 | 2011-08-18 00:35:00 UTC | -73.982738 | 40.761270 | -73.991242 | 40.750562 | 2 |
| 3 | 2012-04-21 04:30:42.0000001 | 7.7 | 2012-04-21 04:30:42 UTC | -73.987130 | 40.733143 | -73.991567 | 40.758092 | 1 |
| 4 | 2010-03-09 07:51:00.000000135 | 5.3 | 2010-03-09 07:51:00 UTC | -73.968095 | 40.768008 | -73.956655 | 40.783762 | 1 |
| 5 | 2011-01-06 09:50:45.0000002 | 12.1 | 2011-01-06 09:50:45 UTC | -74.000964 | 40.731630 | -73.972892 | 40.758233 | 1 |
| 6 | 2012-11-20 20:35:00.0000001 | 7.5 | 2012-11-20 20:35:00 UTC | -73.980002 | 40.751662 | -73.973802 | 40.764842 | 1 |
| 7 | 2012-01-04 17:22:00.00000081 | 16.5 | 2012-01-04 17:22:00 UTC | -73.951300 | 40.774138 | -73.990095 | 40.751048 | 1 |
| 8 | 2012-12-03 13:10:00.000000125 | 9.0 | 2012-12-03 13:10:00 UTC | -74.006462 | 40.726713 | -73.993078 | 40.731628 | 1 |
| 9 | 2009-09-02 01:11:00.00000083 | 8.9 | 2009-09-02 01:11:00 UTC | -73.980658 | 40.733873 | -73.991540 | 40.758138 | 2 |
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 100000 entries, 0 to 99999 Data columns (total 8 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 key 100000 non-null object 1 fare_amount 100000 non-null float64 2 pickup_datetime 100000 non-null object 3 pickup_longitude 100000 non-null float64 4 pickup_latitude 100000 non-null float64 5 dropoff_longitude 100000 non-null float64 6 dropoff_latitude 100000 non-null float64 7 passenger_count 100000 non-null int64 dtypes: float64(5), int64(1), object(2) memory usage: 6.1+ MB
df.describe()
| fare_amount | pickup_longitude | pickup_latitude | dropoff_longitude | dropoff_latitude | passenger_count | |
|---|---|---|---|---|---|---|
| count | 100000.000000 | 100000.000000 | 100000.000000 | 100000.000000 | 100000.000000 | 100000.000000 |
| mean | 11.354652 | -72.494682 | 39.914481 | -72.490967 | 39.919053 | 1.673820 |
| std | 9.716777 | 10.693934 | 6.225686 | 10.471386 | 6.213427 | 1.300171 |
| min | -44.900000 | -736.550000 | -74.007670 | -84.654241 | -74.006377 | 0.000000 |
| 25% | 6.000000 | -73.992041 | 40.734996 | -73.991215 | 40.734182 | 1.000000 |
| 50% | 8.500000 | -73.981789 | 40.752765 | -73.980000 | 40.753243 | 1.000000 |
| 75% | 12.500000 | -73.966982 | 40.767258 | -73.963433 | 40.768166 | 2.000000 |
| max | 200.000000 | 40.787575 | 401.083332 | 40.851027 | 404.616667 | 6.000000 |
df.isnull().sum()
key 0 fare_amount 0 pickup_datetime 0 pickup_longitude 0 pickup_latitude 0 dropoff_longitude 0 dropoff_latitude 0 passenger_count 0 dtype: int64
2. Feature Enginnering¶
df_copy = df.copy()
## A little bit of cleaning to drop the duplicates and nuls before start the analysis
df_copy = df_copy.dropna()
df_copy = df_copy.drop_duplicates()
df_copy.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 100000 entries, 0 to 99999 Data columns (total 8 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 key 100000 non-null object 1 fare_amount 100000 non-null float64 2 pickup_datetime 100000 non-null object 3 pickup_longitude 100000 non-null float64 4 pickup_latitude 100000 non-null float64 5 dropoff_longitude 100000 non-null float64 6 dropoff_latitude 100000 non-null float64 7 passenger_count 100000 non-null int64 dtypes: float64(5), int64(1), object(2) memory usage: 6.1+ MB
df_copy['pickup_datetime'] = pd.to_datetime(df_copy['pickup_datetime'], format= "%Y-%m-%d %H:%M:%S UTC")
# Change the time information from the pickup_datetime
df_copy['year'] = df_copy.pickup_datetime.apply(lambda t: t.year)
df_copy['weekday'] = df_copy.pickup_datetime.apply(lambda t: t.weekday())
df_copy['hour'] = df_copy.pickup_datetime.apply(lambda t: t.hour)
def distance(lat1, lon1, lat2, lon2):
'''Receive coordinates of latitute and longitude, and calculate the distance between points
using the formula of Haversine.
float, float, float, float --> float'''
p = 0.017453292519943295 # pi/180
a = 0.5 - np.cos((lat2 - lat1) * p)/2 + np.cos(lat1 * p) * np.cos(lat2 * p) * (1 - np.cos((lon2 - lon1) * p)) / 2
return 0.6213712 * 12742 * np.arcsin(np.sqrt(a))
df_copy['distance'] = distance(df_copy.pickup_latitude, df_copy.pickup_longitude,
df_copy.dropoff_latitude, df_copy.dropoff_longitude)
3. Data Cleaning¶
Note that we have negative values in the fare_amount, so we need to exclude them, and the columns 'key' and 'pickup_datetime' won't be necessaries anymore because of our new columns¶
df_copy = df_copy[df_copy.fare_amount > 0]
df_copy = df_copy[df_copy.distance > 0]
df_copy = df_copy.drop(['key', 'pickup_datetime'], axis=1)
df_copy.head()
| fare_amount | pickup_longitude | pickup_latitude | dropoff_longitude | dropoff_latitude | passenger_count | year | weekday | hour | distance | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 4.5 | -73.844311 | 40.721319 | -73.841610 | 40.712278 | 1 | 2009 | 0 | 17 | 0.640487 |
| 1 | 16.9 | -74.016048 | 40.711303 | -73.979268 | 40.782004 | 1 | 2010 | 1 | 16 | 5.250670 |
| 2 | 5.7 | -73.982738 | 40.761270 | -73.991242 | 40.750562 | 2 | 2011 | 3 | 0 | 0.863411 |
| 3 | 7.7 | -73.987130 | 40.733143 | -73.991567 | 40.758092 | 1 | 2012 | 5 | 4 | 1.739386 |
| 4 | 5.3 | -73.968095 | 40.768008 | -73.956655 | 40.783762 | 1 | 2010 | 1 | 7 | 1.242218 |
4. Data Visualisation¶
Visualising the geospatial locations for the pickup points.¶
from folium.plugins import HeatMap
# folium map created
m = folium.Map(location=[40.75, -73.98], zoom_start=11, tiles='CartoDB positron')
# creating the heat layer
heat_data = [[row['pickup_latitude'], row['pickup_longitude'], row['fare_amount']] for _, row in df_copy.iterrows()]
HeatMap(heat_data, radius=15, max_zoom=13).add_to(m)
m
Visualising the geospatial locations for the dropoff points¶
m_dropoff = folium.Map(location=[40.75, -73.98], zoom_start=11, tiles='CartoDB positron')
heat_data_dropoff = [[row['dropoff_latitude'], row['dropoff_longitude'], row['fare_amount']] for _, row in df_copy.iterrows()]
HeatMap(heat_data_dropoff, radius=15, max_zoom=13).add_to(m_dropoff)
m_dropoff
Histrogram plot of fare price¶
plt.style.use('ggplot')
plt.figure(figsize=(12, 5))
plt.hist(df_copy['fare_amount'], bins=100, color='skyblue')
plt.xlabel("Fare ($)")
plt.ylabel("Amount")
plt.title("Histogram of Fare ($)")
plt.show()
Barplot for visualizing the number of rides in the following years¶
year_counts = df_copy['year'].value_counts()
plt.figure(figsize=(15, 4))
plt.bar(year_counts.index, year_counts.values, color=palette)
plt.ylabel("Ride Count")
plt.xlabel("Year")
plt.title("Annual Ride Distribution")
plt.show()
Traffic in the year 2012¶
year2012_insight = df_copy[df_copy['year'] == 2012]
xlim = [-74.03, -73.85]
ylim = [40.70, 40.85]
year2012_traffic_insight = year2012_insight.copy()
year2012_insight = year2012_insight.query(
"pickup_longitude > @xlim[0] and pickup_longitude < @xlim[1] and "
"dropoff_longitude > @xlim[0] and dropoff_longitude < @xlim[1] and "
"pickup_latitude > @ylim[0] and pickup_latitude < @ylim[1] and "
"dropoff_latitude > @ylim[0] and dropoff_latitude < @ylim[1]"
)
fig, axes = plt.subplots(1, 2, figsize=(15, 7))
axes[0].plot(year2012_insight.dropoff_longitude, year2012_insight.dropoff_latitude, 'o', alpha = .5, markersize = 2, color="#fff", markeredgecolor='#000', markeredgewidth=1.5)
axes[0].plot(year2012_insight.dropoff_longitude, year2012_insight.dropoff_latitude, '.', alpha = .8, markersize = .5, color="red")
axes[0].legend(['Dropoff Points', "Pickup Points"])
plt.xlabel("\nTraffic in the Year 2012 \n(Black --> Dropoff Points, Red --> Pickup Points)")
plt.grid(False)
days_list = {'monday' : 0, 'tuesday' : 1, 'wednesday' : 2, 'thursday' : 3, 'friday' : 4, 'saturday' : 5, 'sunday' : 6}
weeklyTraffic = year2012_insight['weekday'].value_counts()
axes[1].pie(weeklyTraffic.values, labels=days_list, autopct="%.2f%%", explode=[0.1, 0.1, 0.1, 0, 0, 0, 0], colors=palette)
plt.xlabel("\nAmount of weekdays of Year 2012")
plt.show()
from sklearn.cluster import KMeans
from sklearn.neighbors import KNeighborsClassifier
loc_df = pd.DataFrame()
loc_df['longitude'] = year2012_insight.dropoff_longitude
loc_df['latitude'] = year2012_insight.dropoff_latitude
kmeans = KMeans(n_clusters=15, random_state=2, n_init = 10).fit(loc_df)
loc_df['label'] = kmeans.labels_
plt.figure(figsize = (10, 10))
for label in loc_df.label.unique():
plt.plot(loc_df.longitude[loc_df.label == label],loc_df.latitude[loc_df.label == label],'.', alpha = 1, markersize = 0.8)
plt.title('Clusters of New York of year 2012')
plt.show()
Let's see the peaks days with their rush hours of year 2012¶
def visualize_peakDaysF(day, color='r'):
year2012_monday_insight = year2012_insight[year2012_insight["weekday"] == day]
day_name = list(days_list.keys())[day]
plt.figure(figsize = (15, 70))
max_pickup, max_pgcnt = 0, 0
for hrs in range(24):
specDay_traffic = year2012_monday_insight[year2012_monday_insight['hour'] == hrs]
pickup = len(specDay_traffic)
pgn_cnt = specDay_traffic["passenger_count"].sum()
max_pickup = max(max_pickup, pickup)
max_pgcnt = max(max_pgcnt, pgn_cnt)
longitude = list(specDay_traffic.pickup_longitude) + list(specDay_traffic.dropoff_longitude)
latitude = list(specDay_traffic.pickup_latitude) + list(specDay_traffic.dropoff_latitude)
plt.subplot(24, 6, hrs+1)
plt.title("\nHour: " + str(hrs) + " [pickup="+ str(pickup)+",\npassengers count="+ str(pgn_cnt)+"] ", fontsize=12)
plt.grid(False)
plt.xticks([])
plt.yticks([])
plt.plot(longitude,latitude,'.', alpha = 0.6, markersize = 10, color=color)
# break
plt.suptitle("\n"+ day_name.capitalize() +" (max pickups=" + str(max_pickup) + ", max passengers=" + str(max_pgcnt) + ")\n\n\n\n\n\n", fontsize=20)
plt.tight_layout()
plt.show()
Visualize the rush hours for monday¶
visualize_peakDaysF(0, color='#4856fb')
Visualize the rush hours for tuesday¶
visualize_peakDaysF(1, color='#10a2f0')
Visualize the rush hours forSunday¶
Now, let's see the rush hours of sunday in which the lowest traffic is generated of the 2011 year
visualize_peakDaysF(6, color='#ffa256')
Histrogram plot for the distances travelled¶
year2012_insight.distance.hist(bins=30, figsize=(15,4), color='#20beff')
plt.xlabel("distances")
plt.title("Histrogram plot of distance")
plt.show()
This histrogram gives us a veiw that most of the ride taken was a short ride
year2012_insight.groupby('passenger_count')[['distance', 'fare_amount']].mean()
| distance | fare_amount | |
|---|---|---|
| passenger_count | ||
| 0 | 1.719203 | 8.408661 |
| 1 | 1.716163 | 9.625725 |
| 2 | 1.746542 | 9.949899 |
| 3 | 1.736295 | 9.793232 |
| 4 | 1.916772 | 10.491786 |
| 5 | 1.744920 | 9.582713 |
| 6 | 1.775214 | 10.102332 |
print("Average $USD/Mile : {:0.2f}".format(year2012_insight.fare_amount.sum()/year2012_insight.distance.sum()))
Average $USD/Mile : 5.61
Scatter plot visualization between Fare(in $USD) vs Distance(in Miles) of year 2012¶
plt.figure(figsize=(15,4))
plt.scatter(year2012_insight.fare_amount, year2012_insight.distance, c=year2012_insight.fare_amount,
cmap=plt.cm.rainbow, alpha=0.8, s=30, marker=".")
plt.xlabel("Fare(in $USD)")
plt.ylabel("Distance(in Miles)")
plt.title("Scatter plot Fare(in $USD) vs Distance(in Miles)\n")
plt.xlim(0,60)
ol = plt.grid(False)
plt.colorbar(ol)
plt.show()
Looking at this data, we can say:
Some trips have zero distance but a fare bigger than zero. Maybe these are trips that started and ended in the same place? It will be hard to predict these fares because we don’t have enough information in the dataset.
In general, there seems to be a (linear) connection between distance and fare.
Most rides have an initial charge of $2.50 when you start.
It also looks like someone paid a lot more than usual (> $120).
Note: The distance in the dataset is calculated in a straight line (point to point). In real life, the road distance is longer.<\b>
# removing datapoints with distance < 0.05 miles
print(f"Original size: {len(year2012_insight)}")
train_df = year2012_insight.query("distance >= 0.05")
print(f"New size: {len(train_df)}")
Original size: 14225 New size: 14149
5. Model (for the year 2012 only)¶
Based on the analysis we can build a BaseLine model, we will try a Linear Regression Model, XGBoost regression, Decision Tree regression, Random Forest Regression and LinghGBM.
model_data = train_df[['year', 'hour', 'distance', 'passenger_count', 'fare_amount']]
model_data.head()
| year | hour | distance | passenger_count | fare_amount | |
|---|---|---|---|---|---|
| 3 | 2012 | 4 | 1.739386 | 1 | 7.7 |
| 6 | 2012 | 20 | 0.966733 | 1 | 7.5 |
| 7 | 2012 | 17 | 2.582073 | 1 | 16.5 |
| 8 | 2012 | 13 | 0.778722 | 1 | 9.0 |
| 10 | 2012 | 7 | 0.854123 | 1 | 5.3 |
Train, test split of the datasets¶
X = model_data[['year', 'hour', 'distance', 'passenger_count']]
y = model_data[['fare_amount']]
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
Linear Regression Run¶
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
model_lin = Pipeline((
("standard_scaler", StandardScaler()),
("lin_reg", LinearRegression()),
))
model_lin.fit(X_train, y_train)
Pipeline(steps=[('standard_scaler', StandardScaler()),
('lin_reg', LinearRegression())])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('standard_scaler', StandardScaler()),
('lin_reg', LinearRegression())])StandardScaler()
LinearRegression()
from sklearn.metrics import r2_score
y_test_pred = model_lin.predict(X_test)
score = r2_score(y_test, y_test_pred)
print("The accuracy of our model is {}%".format(round(score, 2) *100))
The accuracy of our model is 78.0%
from sklearn.model_selection import cross_val_score
cv_scores = cross_val_score(model_lin, X_train, y_train, cv=5, scoring='r2')
print("Cross-validation scores (R^2):", cv_scores)
print("Mean R^2:", cv_scores.mean())
print("Standard deviation of R^2:", cv_scores.std())
Cross-validation scores (R^2): [0.72157324 0.7675404 0.75781654 0.69141379 0.72282491] Mean R^2: 0.7322337751726523 Standard deviation of R^2: 0.027457171287698062
## This function will automatizate the process to evaluate the models
def evaluate_models(model):
y_test_pred = model.predict(X_test)
score = r2_score(y_test, y_test_pred)
print("The accuracy of our model is {}%".format(round(score, 2) *100))
cv_scores = cross_val_score(model, X_train, y_train, cv=5, scoring='r2')
print("Cross-validation scores (R^2):", cv_scores)
print("Mean R^2:", cv_scores.mean())
print("Standard deviation of R^2:", cv_scores.std())
XGBoost Run¶
from xgboost import XGBRegressor
model_xgb = Pipeline((
("standard_scaler", StandardScaler()),
("xgb_reg", XGBRegressor(objective='reg:squarederror', random_state=42)),
))
model_xgb.fit(X_train, y_train)
Pipeline(steps=[('standard_scaler', StandardScaler()),
('xgb_reg',
XGBRegressor(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None,
early_stopping_rounds=None,
enable_categorical=False, eval_metric=None,
feature_types=None, gamma=None, grow_policy=None,
importance_type=None,
interaction_constraints=None, learning_rate=None,
max_bin=None, max_cat_threshold=None,
max_cat_to_onehot=None, max_delta_step=None,
max_depth=None, max_leaves=None,
min_child_weight=None, missing=nan,
monotone_constraints=None, multi_strategy=None,
n_estimators=None, n_jobs=None,
num_parallel_tree=None, random_state=42, ...))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('standard_scaler', StandardScaler()),
('xgb_reg',
XGBRegressor(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None,
early_stopping_rounds=None,
enable_categorical=False, eval_metric=None,
feature_types=None, gamma=None, grow_policy=None,
importance_type=None,
interaction_constraints=None, learning_rate=None,
max_bin=None, max_cat_threshold=None,
max_cat_to_onehot=None, max_delta_step=None,
max_depth=None, max_leaves=None,
min_child_weight=None, missing=nan,
monotone_constraints=None, multi_strategy=None,
n_estimators=None, n_jobs=None,
num_parallel_tree=None, random_state=42, ...))])StandardScaler()
XGBRegressor(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric=None, feature_types=None,
gamma=None, grow_policy=None, importance_type=None,
interaction_constraints=None, learning_rate=None, max_bin=None,
max_cat_threshold=None, max_cat_to_onehot=None,
max_delta_step=None, max_depth=None, max_leaves=None,
min_child_weight=None, missing=nan, monotone_constraints=None,
multi_strategy=None, n_estimators=None, n_jobs=None,
num_parallel_tree=None, random_state=42, ...)evaluate_models(model_xgb)
The accuracy of our model is 74.0% Cross-validation scores (R^2): [0.67019365 0.74424673 0.67675671 0.70065511 0.6861153 ] Mean R^2: 0.695593499335429 Standard deviation of R^2: 0.026391554233688042
Decision Tree Regression¶
from sklearn.tree import DecisionTreeRegressor
model_tree = Pipeline((
("standard_scaler", StandardScaler()),
("tree_reg", DecisionTreeRegressor(random_state=42)),
))
model_tree.fit(X_train, y_train)
Pipeline(steps=[('standard_scaler', StandardScaler()),
('tree_reg', DecisionTreeRegressor(random_state=42))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('standard_scaler', StandardScaler()),
('tree_reg', DecisionTreeRegressor(random_state=42))])StandardScaler()
DecisionTreeRegressor(random_state=42)
evaluate_models(model_tree)
The accuracy of our model is 61.0% Cross-validation scores (R^2): [0.42052636 0.58426184 0.41315673 0.52118714 0.50310667] Mean R^2: 0.4884477476040424 Standard deviation of R^2: 0.06441916539853423
Random Forest Regression¶
from sklearn.ensemble import RandomForestRegressor
model_rf = Pipeline((
("standard_scaler", StandardScaler()),
("rf_reg", RandomForestRegressor(random_state=42, n_estimators=100)),
))
model_rf.fit(X_train, y_train)
Pipeline(steps=[('standard_scaler', StandardScaler()),
('rf_reg', RandomForestRegressor(random_state=42))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('standard_scaler', StandardScaler()),
('rf_reg', RandomForestRegressor(random_state=42))])StandardScaler()
RandomForestRegressor(random_state=42)
evaluate_models(model_rf)
The accuracy of our model is 75.0% Cross-validation scores (R^2): [0.67489505 0.73322172 0.71141027 0.68725541 0.70125229] Mean R^2: 0.7016069482950061 Standard deviation of R^2: 0.02007593953599224
LightGBM regressor¶
from lightgbm import LGBMRegressor
model_lgbm = Pipeline((
("standard_scaler", StandardScaler()),
("lgbm_reg", LGBMRegressor(random_state=42)),
))
#
model_lgbm.fit(X_train, y_train)
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000436 seconds. You can set `force_col_wise=true` to remove the overhead. [LightGBM] [Info] Total Bins 288 [LightGBM] [Info] Number of data points in the train set: 10611, number of used features: 3 [LightGBM] [Info] Start training from score 9.622901
Pipeline(steps=[('standard_scaler', StandardScaler()),
('lgbm_reg', LGBMRegressor(random_state=42))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('standard_scaler', StandardScaler()),
('lgbm_reg', LGBMRegressor(random_state=42))])StandardScaler()
LGBMRegressor(random_state=42)
evaluate_models(model_lgbm)
The accuracy of our model is 78.0% [LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000177 seconds. You can set `force_row_wise=true` to remove the overhead. And if memory is not enough, you can set `force_col_wise=true`. [LightGBM] [Info] Total Bins 288 [LightGBM] [Info] Number of data points in the train set: 8488, number of used features: 3 [LightGBM] [Info] Start training from score 9.630914 [LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000143 seconds. You can set `force_row_wise=true` to remove the overhead. And if memory is not enough, you can set `force_col_wise=true`. [LightGBM] [Info] Total Bins 288 [LightGBM] [Info] Number of data points in the train set: 8489, number of used features: 3 [LightGBM] [Info] Start training from score 9.616115 [LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000124 seconds. You can set `force_row_wise=true` to remove the overhead. And if memory is not enough, you can set `force_col_wise=true`. [LightGBM] [Info] Total Bins 288 [LightGBM] [Info] Number of data points in the train set: 8489, number of used features: 3 [LightGBM] [Info] Start training from score 9.663482 [LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000140 seconds. You can set `force_row_wise=true` to remove the overhead. And if memory is not enough, you can set `force_col_wise=true`. [LightGBM] [Info] Total Bins 288 [LightGBM] [Info] Number of data points in the train set: 8489, number of used features: 3 [LightGBM] [Info] Start training from score 9.606879 [LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000095 seconds. You can set `force_row_wise=true` to remove the overhead. And if memory is not enough, you can set `force_col_wise=true`. [LightGBM] [Info] Total Bins 288 [LightGBM] [Info] Number of data points in the train set: 8489, number of used features: 3 [LightGBM] [Info] Start training from score 9.597114 Cross-validation scores (R^2): [0.71743692 0.77811075 0.75262279 0.71828627 0.7383625 ] Mean R^2: 0.7409638475739545 Standard deviation of R^2: 0.022761280336944113
6. Conclusion¶
After the analysis of the data we constructed some models to predict the NYC Taxi Fare price based on some variables. The better results we had in the prediction were from the Linear Regression Model and the LightGBM. The results of these two models are almost the same, and the two of them are light and don't use a lot of computational ressources.